A major issue for me coming to Python from Matlab was how to save my workspaces. This is especially crucial when finalizing results in support of a manuscript. It is painful to have reviewers to ask for other statistics or new analyses and have to run everything over again to address such issues. Also, some analyses take a long time to run. So, how the heck does one save workspace variables into a file in Python? It turns out to be not that difficult. Several established libraries exist for this purpose. One of these libraries, dill, is very good for short-term saves. Others are better for long-term data storage and for sharing with others who might still be dependent on Matlab or who use R.
The code below brings in five options for saving your workspace.
By the way, this code was written on PCs running Linux Mint, an Ubuntu variant, and the Python installation was based on Continuum Analytics Anaconda Python distro.
dill is an extension of Python pickle module that enables saving (serializing) most of the common Python datatypes. It depends on the version of Python and libraries that are installed on the computer that creates the dilled workspace. For me, it is the go-to library for when I am working on analysis on my office PC and need to head out and carry on using my notebook. However, given its limitation (dependence on version of Python and libraries), it does not seem like a good idea for long-term data storage.
numpy has a nice function called savez that saves several arrays into a single file in an uncompressed or compressed format. It is fast to use, but depends on Python. However, recently a library for R called RcppCNPy was written that makes it easy to load and save data in this format.
scipy includes functions for reading and writing Matlab version 4 and 5 files, savemat and [loadmat]https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.io.loadmat.html. These are very useful especially if you are using both Python and Matlab or have collaborators stuck on Matlab.
Perhaps the best long-term storage format is hdf. This format is used for the most recent versions of Matlab and can be directly read into GNU-Octave. Well-established libraries exist for working with hdf files in R and Julia. The HDF Group supplies a viewer for hdf files that makes it easy to check on the contents of a file without reading the file into Python. I have found that two Python libraries, h5py and hdf5storage, useful for working with hdf files in Python. h5py is fast and easy to use. hdf5storage is slower but produces compressed saves by default.
This notebook shows how to use these libraries for saving your workspace in Python. The data set is part of the demo data file provided with NeuronExplorer, written by my grad school lab colleague Alex Kirillov. NeuronExplorer is an excellent tool for working with neurophysiological datafiles. My lab depends on it.
The first step is to import the relevant libraries.
In [1]:
import dill
import numpy as np
from scipy.io import loadmat, savemat
import h5py
import hdf5storage
Switch folders and load the neuronal data, which were parsed out of a nex file using old Matlab code.
In [2]:
%cd ~/Desktop/Spikes-and-Fields/NEx-demo
In [3]:
NEx_demo = loadmat('SpikesAndFields.mat')
loadmat puts the variables from the Matlab/Octave workspace into a dict.
In [4]:
Keys = NEx_demo.keys()
print(Keys)
My work style is to put each neuron, lfp, or behavioral event into its own variable in the workspace.
In [5]:
Neuron04a = NEx_demo['Neuron04a']
Neuron05b = NEx_demo['Neuron05b']
Neuron05c = NEx_demo['Neuron05c']
Neuron06b = NEx_demo['Neuron06b']
Neuron06d = NEx_demo['Neuron06d']
Neuron07a = NEx_demo['Neuron07a']
Event04 = NEx_demo['Event04']
Event05 = NEx_demo['Event05']
Event06 = NEx_demo['Event06']
ADmat = NEx_demo['AD01'] # LFP data
adfreq = NEx_demo['adfreq'] # sampling frequency
ts = NEx_demo['ts'] # ts is the temporal offset between spikes/events and fields in the Plexon recording file
Clean up a bit.
In [6]:
%xdel NEx_demo
%xdel Keys
Display the arrays in the workspace.
In [7]:
%whos ndarray
Switch to a temporary directory to evaluate saving using the various Python tools.
(I use Dropbox and SpiderOak for backups, but my temp folder is not backed up. I hate wasting bandwidth.)
In [8]:
%cd ~/temp
In [9]:
%time dill.dump_session('test.pkl')
In [10]:
ls -lstr test.pkl
In [11]:
%reset -f
%who
In [13]:
import dill
%time dill.load_session('test.pkl')
%whos
Unlike numpy's savez and scipy's savemat, hdf5 needs to know the datatypes that are to be saved. For this example, ADmat, W, and icasig are ndarrays and adfreq and ts are floats.
HDF5
In [14]:
%%time
with h5py.File('test.h5', 'w') as hf:
hf.create_dataset('ADmat', data=ADmat, compression="gzip", shuffle=True)
hf.create_dataset('adfreq', data=adfreq, compression="gzip", shuffle=True)
hf.create_dataset('ts', data=ts, compression="gzip", shuffle=True)
hf.create_dataset('Event04', data=Event04, compression="gzip", shuffle=True)
hf.create_dataset('Neuron04a', data=Neuron04a, compression="gzip", shuffle=True)
In [15]:
ls -lstr test.h5
all files verified using hdf viewer; file loads directly into Octave and Matlab
hdf5storage -- default options are much slower than direct calls to h5py; however, the file is saved more efficiently, and this effect is more apparent with more LFP channels
In [16]:
# dict is used to set up variables for hdf5storate.writes
vars = {'ADmat':ADmat, 'adfreq':adfreq, 'ts':ts, 'Event04':Event04, 'Neuron04a':Neuron04a}
In [17]:
%%time
hdf5storage.writes(vars, filename='test_hdf5storage.h5')
In [18]:
ls -lstr test_hdf5storage.h5
np.savez
In [19]:
%%time
np.savez('test', 'ADmat':ADmat, 'adfreq':adfreq, 'ts':ts, 'Event04':Event04, 'Neuron04a':Neuron04a)
In [20]:
ls -lstr test.npz
this format seems to be stable and standard; a library exists for reading it into R (RcppCNPy: https://cran.r-project.org/web/packages/RcppCNPy/vignettes/RcppCNPy-intro.pdf); this would be useful for saving intermediate files between R and Python, and could be used as a long-term option, for Python only; however, the compressed options are no better than 8% (from my testing)
savemat (v5 matlab, from scipy)
In [21]:
%%time
savemat('test.mat', vars)
In [22]:
ls -lstr test.mat
this is a fast way to save data in a format that is easily read into and out of Matlab/Octave
this format is much faster than H5, but does not offer compression and requires a slow (and complex) library to be read into R (R.matlab)
clean up
In [23]:
ls -lstr
In [24]:
rm test*.*
In [25]:
ls
feather and blospack are other options; blospack is very fast compression, but the Github page cautions about long term stability; same is true for feather; it is very fast but again the Github page cautions on stability and the Python implementation currently does not work with row-to-column conversions